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Abstract: We investigate the asymptotic normality of the posterior distri- 
bution in the discrete setting, when model dimension increases with sample 
size. We consider a probability mass function 6o on N \ {0} and a sequence 
r^ . of truncation levels {k„)„ satisfying fejj < ninfi<j.^ 9o(^)- Let 9 denote the 

maximum likelihood estimate of {So{i))i<k„ and let An{9o) denote the fcn- 
dimensional vector which i-th coordinate is defined by y/n{dn{i) — 6o{i)) for 
1 < i < k„. We check that under mild conditions on 9o and on the sequence 
of prior probabilities on the fc„-dimensional simplices, after centering and 
rescaling, the variation distance between the posterior distribution recen- 

Cn ' tered around On and rescaled by y/n and the fcn-dimensional Gaussian dis- 

^ ' tribution Af{An{9o),I~^{9o)) converges in probability to 0. This theorem 

V^^ can be used to prove the asymptotic normality of Bayesian estimators of 

^^ , Shannon and Renyi entropies. 

CD ■ The proofs are based on concentration inequalities for centered and non- 

fvj ' centered Chi-square (Pearson) statistics. The latter allow to establish pos- 

• ' terior concentration rates with respect to Fisher distance rather than with 

I _ ' respect to the Hellinger distance as it is commonplace in non-parametric 

(_^ Bayesian statistics. 

oo 
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1. Introduction 

The classical Bernstein-Von Mises Theorem asserts that for regular (Bellinger 
differentiable) parametric models, under mild smoothness conditions on the 
prior distribution, after centering around the maximum likelihood estimate and 
rescaling, the posterior distribution of the parameter is asymptotically Gaus- 
sian and that the limiting covariance matrix coincides with the inverse of the 
Fisher information matrix. This theorem provides a frequentist perspective on 
the Bayesian methodology and elements for reconciliation of the two approaches. 
In regular parametric models, Bernstein- von Mises theorems motivate the inter- 
change of Bayesian credible sets and frequentist confidence regions. Refinements 
of the Bernstein- von Mises theorem have also proved helpful when analyzing the 
redundancy of universal coding for smoothly parametrized classes of sources over 
finite alphabets. 

The proof of the classical Bernstein-Von Mises theorem relies on rather so- 
phisticated arguments. Some of them seem to be tied up with the finite dimen- 
sionality of the considered models. Hence, extensions of Bernstein-von Mises 
theorems to non-parametric and semi-parametric settings have both received 
deserved attention and shown moderate progress during the last four decades. 
Soon after Bayesian inference was put on firm frequentist foundations by Doob 
(1949), Schwartz (1965) and others, Freodniau (1963) (see also Frecdniau, 1965) 
pointed out that even when dealing with the simplest possible case, that of in- 
dependent, identically distributed, discrete observations, there is no such thing 
as a general posterior consistency result let alone a general Bernstein-Von Mises 
Theorem. Moreover, according to the evidence presented by Frecdman (1965), 
it is mandatory to focus moderately large classes of distributions. Despite such 
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early negative results, non-parametric Bayesian theory has been progressing at 
a steady paee. The framework of empirical process theory has enabled to pro- 
vide sufficient conditions for posterior consistency and to relate posterior con- 
centration rates to model complexity (Gliosal and van dcr Vaart, 2007b, 2001; 
Ghosal et al, 2000). 

Among the different approaches to non-parametric inference, using simple 
models with increasing dimensions has attracted attention in the context of 
maximum likelihood inference (Portnoy, 1988; Fan and Truong, 1993; Fan et al., 
2001; Fan, 199;3) and in the context of Bayesian inference (Ghosal, 2000). The 
last reference is especially relevant to this paper. Therein, S. Ghosal consid- 
ers nested sequences of exponential models satisfying a number of assumptions 
involving the growth rate of models with sample size, the growth rate of the 
determinant of the Fisher information matrix with respect to model dimension 
(and thus sample size), prior smoothness, and moment bounds for score func- 
tions in small KuUback-Leibler balls located around the sampling probability 
(those conditions will be explained and compared with our own conditions in 
Section 3.1). S. Ghosal proves a Bernstein-Von Mises Theorem (Ghosal, 2000, 
Theorem 2.3) for the log-odds parametrization, partially building on previous 
results from Portnoy (1988) concerning maximum likelihood estimates. However 
our objectives significantly differ from those of S. Ghosal. In (Ghosal, 2000), the 
main application of non-parametric Bernstein-Von Mises Theorems for multino- 
mial models seems to be non-parametric density estimation using histograms. 
This framework justifies special attention to multinomial distributions which are 
almost uniform. Our ultimate goal is quite different. In information-theoretical 
language, we are interested in investigating memoryless sources over infinite al- 
phabets as in (See Kieffer, 1978; Gyorfi et al, 1993; Boucheron et al, 2009, and 
references therein). In Information Theory, refinements of Bernstein-Von Mises 
Theorems allow to investigate the so-called maximin redundancy of universal 
coding over parametric classes of sources (Clarke and Barron, 1994). In Infor- 
mation Theory, a source over a (countable alphabet) is a probability distribution 
over the set of infinite sequences of symbols from the alphabet. The redundancy 
of a (coding) probability distribution with respect to a source on a given (finite) 
sequence of symbols is the logarithm of the ratio between the probability of the 
sequence under the source and under the coding probability. In universal coding 
theory, average redundancy with respect to a prior distribution over sources can 
be written as the difference between the (differential) Shannon entropy of the 
prior distribution and the average value of the (differential) entropy of the condi- 
tional posterior distribution. Thanks to non-trivial refinements of the Bernstein- 
Von Mises Theorem, the latter conditional entropy can be approximated by the 
(differential) entropy of a Gaussian distribution which covariance matrix is the 
inverse of the Fisher information matrix defined by the source under consid- 
eration. This elegant approach provides sharp asymptotic and non-asymptotic 
results when dealing with classes of sources which are soundly parameterized by 
subsets of finite-dimensional spaces (See Clarke and Barron, 1990, for precise 
definitions). When turning to larger classes of sources, for example toward mem- 
oryless sources over countable alphabets (Boucheron et al., 2009), this approach 
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to the characterization of niaximin redundancy has not (yet) been carried out. 
A major impediment is the current unavailabihty of adequate non-parametric 
Bernstein-Von Mises Theorems. 

This paper is a first step in developing the Bayesian tools that are use- 
ful to precisely quantify the minimax redundancy of universal coding of non- 
parametric classes of sources over infinite alphabets. Because of our ultimate 
goals, we cannot focus on almost uniform multinomial models. We are specifi- 
cally interested in situations where the sampling probability mass functions de- 
cay at a prescribed rate (say algebraic or exponential) as in (Boucheron et al., 
2009). 

As pointed out by Ghosal, in models with an increasing number of pa- 
rameters, justifying the asymptotic normality of the posterior distribution is 
more involved, and precisely characterizing under which conditions on prior 
and sampling distribution this asymptotic normality holds remains an open- 
ended question. For example, in the context of discrete distributions, several 
ways of defining the divergence between distributions look reasonable. Most of 
the recent work on non-parametric Bayesian statistics dealt with posterior con- 
centration rates and has been developed using Hellinger distance (Ghosal ct al., 
2000; Ghosal and van der Vaart, 2007b, 2001). One may wonder whether some 
posterior concentration rate results obtained using Hellinger metrization can 
be strengthened. It is not clear how to tackle this issue in full generality. In 
this paper, taking advantage of the peculiarities of our models, we use another, 
demonstrably stronger, information divergence, the Fisher (x^) "distance" and 
establish posterior concentration rates with respect to Fisher balls (see 3.6). 
The proof relies on known concentration inequalities for centered x^ (Pear- 
son) statistics and (apparently) new concentration inequalities for non-centered 
X^ statistics. 

Paraphrasing van der Vaart (1998), as the notion of convergence in the 
Bernstein-Von Mises Theorem is a rather complicated one, the expected re- 
ward, once such a Theorem has been proved, is that "nice" functional applied 
to the posterior laws should converge in distribution in the usual sense. An 
obvious candidate for deriving that kind of method is a Bayesian variation on 
the Delta method. However, we are facing here two kinds of obstacles. On the 
one hand, we cannot rely on the availability of a Bernstein-Von Mises Theo- 
rem when considering the infinite-dimensional model (Frocdman, 1963, 1965). 
This precludes using the traditional functional Delta method as described for 
example in (van der Vaart and Wellner, 1996; van der Vaart, 1998). On the 
other hand, when considering models of increasing dimensions, a variant of 
the Delta method has to be derived in an ad hoc manner. This is what we do. 
We assess this rule of thumb by examining plug-in estimates of Shannon and 
Renyi entropies. Such functionals characterize the compressibility of a given 
probability distribution (Csiszar and Korncr, 1981; Cover and Thomas, 1991; 
Gallager, 1968). The problem of estimating such functionals has been inves- 
tigated by Antos and Kontoyiamiis (2001) and Paninski (2004). It has been 
checked there that plug-in estimates of the Shannon and Renyi entropies are 
consistent and some lower and upper bounds on the rate of convergence have 
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been proposed. Up to our knowledge, classes of distributions for which plug-in 
estimates satisfy a central limit theorem have not been systematically character- 
ized. Here, the Bernstein-Von Mises Theorem allows to derive central limit theo- 
rems for Bayesian entropy estimators (see Theorem 3.12) and provides the basis 
for constructing Bayesian credible sets. In the present context, those credible 
sets are known to coincide asymptotically with Bayesian bootstrap confidence 
regions (Rubin, 1981). 

The paper is organized as follows. In Section 2, the framework and nota- 
tion of the paper are introduced. A few technical conditions warranting local 
asymptotic normality when handling models of increasing dimensions are also 
stated. The main results of the paper are presented in Section 3. The non- 
parametric Bernstein-Von Mises Theorem (3.7) is described in Subsection 3.1. 
It is complemented by a posterior concentration lemma (3.6) that might be in- 
teresting in its own right. A roadmap of the proof of the Bernstein-Von Mises 
Theorem is stated thereafter. In Paragraph 3.2, the asymptotic normality of 
Bayesian estimators of various entropies is derived using the non-parametric 
Bernstein-Von Mises Theorem and various tail bounds for quadratic forms that 
are also useful in the derivation of the Bernstein-Von Mises theorem. In Para- 
graph 3.3, sequences of Dirichlet priors are checked to satisfy the conditions 
of the Bernstein-Von Mises Theorem. The main results of the paper are il- 
lustrated on the envelope classes investigated by Boucheron ct al. (2009). In 
Subsection 3.5, the setting of Theorem 3.7 is compared with the framework de- 
scribed in (Ghosal, 2000). In Subsection 3.6, the posterior concentration lemma 
is compared with related recent results in non-parametric Bayesian statistics. 
The Proof of the Bernstein-Von Mises Theorem is given in Section 4. It adapts 
Le Cam's proof (Lc Cam and Yang, 2000; van der Vaart, 2002) to the non- 
parametric setting using a collection of old and new non-asymptotic tail bounds 
for chi-square statistics. The proof of the asymptotic normality of Bayesian 
entropy estimators is given in Section 5. It relies on the Bernstein-Von Mises 
Theorem and on the aforementioned tail bounds for chi-square statistics. 



2. Notation and background 

This section describes the statistical framework we will work with, as well as 
the behavior of likelihood ratios in this framework. At the end of the section, a 
useful contiguity result is stated. 

Throughout the paper, 9 = (0(j))jeN, denotes a probability mass function 
over N* = N \ {0} and 8 denotes the set of probability mass functions over 
N*. If the sequence x = xi, . . . ,Xn denotes a sample of n elements from N*, 
let Ni denote the number of occurrences of i in x: Ni{x.) — X]?=i lxj=i • The 
log-likelihood function maps 9 x N" toward R: 

^„(0,x)=^iV,log0(^). 

i>l 

When the sample x is clear from context, £n{9,x.) is abbreviated into in{9)- 
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Throughout the paper, 9q denotes the (unknown) probabihty mass function 
under which samples are collected. Let V, = N^, let Xi, . . . , Xn, ■ ■ ■ denote the 
coordinate projections. Then Pq denotes the probability distribution over n 
(equipped with the cylinder ct- algebra JF), satisfying 



Fo{r^7=lX^^x,} = Y[eoix,). 



Recall that the maximum likelihood estimator 6 of ^o on a sample x is given by 
the empirical probability mass function: 9{i) = Ni/n . 

Let k denote a positive integer that may and should depend on the sample 
size n. We will be interested in the estimation of the 6*0(1) for i = 1, . . . , k. In this 
respect, all the useful information is conveyed by the counts A^;, i = 1, . . . , fc, 
or equivalently in what will be called the truncated version of the sample. The 
truncated version of sample x is denoted by x and constructed as follows 

Xi if X,; < k 

otherwise. 

The counter A'o is defined as the number of occurrences of in x: A^o(x) = 
Si>fe -^i(x) ■ The image of ^ G by truncation is a p.m.f. over {0, . . . , /c}, it is 
still denoted by with 0(0) = X]i>fe ^(*)- Let 6^ denote the set of p.m.f. over 
{0, . . . , k}. In the sequel, depending on context, 9q may denote either the p.m.f. 
on N* from which the sample is drawn or its image by truncation at level k. 

Henceforth, 9 e 0^^ may denote either (0(«))o<i<fc„ or its projection on 
the kn last coordinates (0(«))i<z<fc„; in the same way, if h denotes a vector 
{h{i))o<i<k„ in IR'^""''^ such that X]i=o ^(*) = 0, /i may also denote its projection 
on the kn last coordinates (/i(i))i<i<fc„ depending on the context. 

For a given sample x, the score function is the gradient of the log- likelihood at 
e e 0fc, for z e {1, . . . , k}\ {L{9))^ = N^/e{i)-No/9{0) . Assume ah components 
of S 0fe are positive, then the information matrix I{9) is defined as 



I{9) = ^Ee [L{9)ll{9)] - Diag (^)^ 



1 11^ 



i<i<fe 



0(0) 



and its inverse is 



7-1(0)= Diag (0(z)),<,,<,- : (0(1) ... 9{k)) . 

W)) 

It can be checked that det(/(0)) = Ili^o^^^C*)- The pseudo-sufficient statistic 
A„(0) is defined as 

A„(0) = ^/-i(0)£„(0) = V^(0 - 9) . 
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Note that ^/n(0 — Oq) — A„(0o) ^^id that this fc-dimensional random vector has 
covariance matrix I~^{9o). Moreover for each positive 9 € Qk, ^n(^)^(^)^n(^) 
coincides with the Pearson x^ statistics. 



i=0 



ne{i) 



Let kn denote a truncation level. If h belongs to IR''"+^ and satisfies J2i=o ^(*) 
0, let (T„(/i) be defined by 



k„ 



h\i) 






where we agree on the following convention: if 0o(O ~ and h{i) = 0, then 
h?{i)/9Q{i) = 0. The set Eg^^k^i^) is the intersection of a fc„-dimensional sub- 
space with an ellipsoid in R*''"+^ . 

£e„,kSM) ^\h : al{h) < M, ^ h{i) = 0, /i(z) > -V^i9o{t),i = 0, . . . , fc„ I . 

In the parametric setting, that is when /c„ remains fixed, Le Cam's proof of 
the Bernstein-Von Mises Theorem (van dor Vaart, 1998; van dor Vaart, 2002) is 
made significantly more transparent by resorting to a contiguity argument. In 
order to adapt this argument to our setting, we need to formulate two conditions. 
In the sequel (fcn)neN denotes a non-decreasing sequence of truncation levels. 

Condition 2.1. The p.m.f. 9q and the sequence (fc„)„g]N satisfy 

n inf 0o(*) ^^ +00 . 

Let {hn)neM denote a sequence of elements from R'^"+^ such that for each n, 
Si=o hn{i) = 0. The sequence (/in)neN is said to be tangent at the p.m.f. 6*0 if 
the following condition is satisfied. 

Condition 2.2. There exists a positive real a such that the sequence a'^ihn) 
tends toward a^ > 0. 

The probability distribution lPn,h over {0, . . . , /c„}" is the product distribu- 
tion defined by the perturbed p.m.f. 9o{i) + h{i)/y/n if < 9o{i) + -4=- < 1 
for all i in {0, . . . , fc„}. Wc are now equipped to state the building block of the 
contiguity argument: the proof is given in the appendix (A). 

Lemma 2.3. Let9o denote a probability mass function ower N*. If the sequence 
of truncation levels {kn)ne'N satisfy Condition 2.1 and if the sequence {hn)neM 
satisfies the tangency Condition 2.2 then the sequences (JPn.h)n cind (Pn,o)n cLfe 
mutually contiguous, that is, for any sequence (Bn) of events where for each n, 
Bn C {0, . . . , fc„}", the following holds: 

limP„,,,{B„} = ^ limP„,o{S„} = . 
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Note that throughout the paper, we use Dc Finetti's convention: if (fi, T, P) 
denotes a probabihty space, Z a random variables on fi, JF, then P Z ^ P[Z] = 
P{Z) denotes the expected value of Z (provided it is well-defined, that is PZ+ 
and PZ- are not both infinite. If A denotes an event, then -P{^} = PIa- 

3. Main results 

In a Bayesian setting, the set of parameters is endowed with a prior distribution. 
In this paper, we consider a sequence of prior distributions {Wn)neK matching 
the non-decreasing sequence of truncation levels we use. Let W„ be a prior prob- 
ability distribution for id{i))i<i<k„ such that 6 = {0{i))o<i<k„ S 0fe„. Hence- 
forth, we assume that Wn has a density Wn with respect to Lebcsgue measure 
on R*^". Let T = (T(j))o<i<fc„ be a random variable such that (T(i))i<i<fc^ is 
distributed according to Wn and t(0) = 1 — J2i=i '''(*)■ Conditionally on T ~ 9, 
(X„)„g]N is a sequence of independent random variables distributed according 
to the p.m.f. 0. 

3.1. Non parametric Bernstein-Von Mises Theorem 

Let Hn be the random variable Hn = \/n{T{i) — ^o(*))i<i<fc : and -Pff„|Xi „ its 
posterior distribution, that is its distribution conditionally to the observations 
^i-.n = {^1, ■ ■ ■ , Xn)- If the truncation level kn ~ k (that is the dimension of the 
parameter space &k„) is a constant integer, the classical parametric Bernstein- 
Von Mises Theorem asserts that the sequence of posterior distributions is asymp- 
totically Gaussian with centerings A„(0o) = V^{0 — 6*0) and variance I~^{9o) if 
the observations Xi:„ are independently distributed according to Oq. 

Theorem 3.7 below asserts that under adequate conditions on the sequence of 
priors Wn and on the tail behavior of 6*0, the Bernstein-Von Mises Theorem still 
holds provided the truncation levels kn do not increase too fast toward infinity. 

For any sequence of prior distributions (W„)„ , for a sequence M„ of real 
numbers increasing to -l-oo, and a sequence {kn)n of truncation levels that satisfy 
Condition 2.1, we will use the following three conditions in order to establish 
the three propositions the Bernstein-Von Mises Theorem depends on. 

Condition 3.1. The sequence of truncation levels (A:„)„ and radii M„ satisfies 
Mn = ((n mi e„{i)y^^Y (3.2) 

kn = o{Mn) . (3.3) 

Requiring a prior smoothness condition is commonplace when establishing 
asymptotic normality of posterior distribution in parametric settings. 

Condition 3.4. (prior smoothness) 

Wn 

sup — / 



^4^0 + ^) , 

sup .^ , o X -^ 1, 
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Requiring a prior concentration condition, sometimes called a small ball prob- 
ability conditions is usual in non-parametric Bayesian statistics. 

Condition 3.5. (prior concentration) 

^log(n)Vlog(det(/(0o)))V-logK(0o)) = o{M,,) 

where dct{I (00)) ^ Uto^o'ii) ■ 

Note that the prior concentration condition entails the second condition in 
Condition 3.1. 

The next lemma which is proved in Section 4.3 asserts that under mild con- 
ditions, the posterior distribution concentrates on x^ (Fisher) balls centered 
around maximum likelihood estimates. 

Lemma 3.6. (Posterior concentration) If the p.m.f. 0o and the sequence 
of truncation levels (fc„),i both satisfy Conditions (2.1, 3.4, 3.5) and if Mn = 
(ninfi<fc„ 0o{i)) then under F„.o 

F„,O^H„|Xi:„ {Hll{Oo)Hn > Mn) = F„,oFff„|A-i:„ {Hn ^ fo,fe„(M„)) ^0. 

This posterior concentration lemma allows to recover the parametric posterior 
concentration phenomenon if truncation levels remain fixed and strengthens 
the generic non-parametric posterior concentration theorem from Ghosal et al. 
(2000). 

Theorem 3.7. (A non-parametric Bernstein-Von Mises Theorem) // 
the sequence of truncation levels (A;n)nGNj ^n ^ +oo, and the p.m.f. over N*, 
^0 satisfy Condition 2.1, and if there is an increasing sequence (Af„)„ tending 
to infinity such that 3.1, 3.4 and 3.5 hold, then 

P„,o||A4„(An(^o),/"'((?o))-Pi/„|Xi:„||->0 

where \\ ■ \\ denotes the total variation norm. 

A comparison of the Theorem with respect to previous results available in 
the literature (Ghosal and van der Vaart, 2007b, a; Ghosal et al., 2000; Ghosal, 
2000) is given at the end of the Section. 

Remark 3.8. A corollary of the Bernstein-Von Mises Theorem is that 

Pn,0 (^H„|X,„ {H^Ii0o)Hn > Un}) ^ 

if and only if u,i/fc„ — ^ oo. 

The proof of Theorem 3.7 is organized along the lines of Le Cam's proof of 
the parametric Bernstein-Von Mises Theorem as exposed by A. van der Vaart 
in (van der Vaart, 1998) (see also van der Vaart (2002)). 
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Roadmap of the proof of the Bernstein Von-Mises theorem. If P is any proba- 
bility distribution on R*^" and Af > is any positive real, let P*^ be the condi- 
tional probability distribution on the ellipsoid {u E R*^" : u'^I{do)u = (T,^i(ii) < 
M}. For any measurable set B, 



pM ^^^ 



P{Bn{u ; u^I{9o)u<M}} 



P{u : uTl{do)u < M} 
To alleviate notations, we will use the shorthands A/fc and J\f,. " 



to denote 



the (random) distributions 7Vfe„(A„(6lo), /~i(6lo)) and 7V;^f"(A„(6lo),/~H^o))■ 
From the triangle inequality, if follows that: 



\WkA^n{0o),r\Oo))-Pi 



< 



A/'fc„-AA^" 



M„ 



H„\Xi.,, 
rMr, 



:,M,^ 



KfM„ _ pn 



P, 



JVf„ 
H„\Xi 



-P, 



H„\Xi 



The proof of Theorem 3.7 boils down to checking that each of the three terms 
on the right-hand side tends to in P„,o probability. 

The first term avers to be the easiest to control thanks to the well-known con- 
centration properties of the Gaussian distribution. Upper bounding the middle 
term is arguably the most delicate part of the proof. The posterior concentration 
Lemma allows to deal with the third term. 

Let us call Nv(Af„) the middle term 



<" 



^H„\Xi 



The posterior density is proportional to the product of the prior density and of 
the likelihood function. Hence, controlling the variation distance between A/", " 



and Pjj 'i^ requires a good understanding of log-likelihood ratios. A quadratic 
Taylor expansion of the log-likelihood ratio leads to: 



logi;^— (x) = 



P 



n,0 



fcn 

E 

i=0 



N.. log 1 



m 



^9o{i) 



k„ 



where R(u) 
and 



= Z^h) 
= Z^h) 






k„ 



2n 



4=0 



alih) , A^h) 



Cn{h) 



h{l) 



ynSo(i) 



satisfies R{u) 
1 ^ ,. h{i) 



0{u) as u tends toward 



z,,ih) = ^j2^' 



A4h) = al{h)-^Y.^^ 



i=l 



'o{^) ' 
h(zl_ 
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and 



a 



k 



h{^\^( h{^) 



9o(i)2 VV^^o(*) 



Performing algebra along the lines described in (van dcr Vaart, 2002, P. 142) 
(computational details are given in the Appendix, see Section C), leads to 



NV(M„) < 



.(^0 



^-1 



An(g)-An{h) 



""(^0 + ^) 



(3.9) 



We prove in Section 4.1 that the decay of Nv(M„) depends on prior smoothness 
around 6*0 and on the ratio between M„ and (ninfj;<fe„ OQ{i)Y''^: 



Proposition 3.10. 

P„,o (nv(M„)) = O 



MS 



inf 



n inf, 



<k„ 



%i^) 



1- 



he£ea.k,AM^)Wn[Oo + ^) 



If the sequence of truncation levels {kn)n o,nd radii (M„)„ satisfies Condi- 
tions (3.1) and (3.4) then 

P„,o (NV(M„)) = o(l) . 



^M„ 



The third term LPrr'lv — Ph ix, is handled thanks to the posterior 
concentration lemma, since by Lemma B.l in the appendix 






-P, 



H„\Xi 



= 2Ph„\x,.,„ {H^I{eo)Hn > M„) 



The proof of the Theorem is concluded by upper-bounding \\Mk„ — N^ "||. 
The latter quantity is a matter of concern because we are facing increasing 
dimensions (fc„)„g]N. It is checked in Section 4.4 that 

Proposition 3.11. There exists a universal constant C such that if 

liminf„ (ninfi<fe^ 0o(i)) ^ Co > and liminfAf„/fc„ > 64, then for large 

enough n 

P„,o AA...-<^" <Ccxp^ M^AcoMr 



C 



D 



3.2. Estimating functionals 

The Bernstein- von Mises Theorem provides a handy tool to check the asymptotic 
normality of estimators of Renyi and Shannon entropies. Antos and Kontoyiannis 
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(2001) established that plug- in estimators of Shannon and Rcnyi entropies are 
consistent whatever the sampling probability is. They also proved that entropy 
estimation may be arbitrarily slow, and that on a large class of sampling distri- 
butions, the mean squared error is 0(logn/7i). In the parametric setting, that 
is with fixed finite alphabets, analogues of the delta- method and the classical 
Bernstein-Von Mises Theorem can be used to check the asymptotic normality 
of both frequentist and Bayesian entropy estimators. Our purpose is to show 
that the non-parametric Bernstein-Von Mises Theorem can be used as well. 

For any a > 0, let ga be the real function defined for non negative real 
numbers by ga{u) = u" for a ^ 1, and gi(u) = ulogu (with the convention 
gi{0) = 0). The additive functional Ga is defined by 

+ 00 

Ga(0)=E3"(^«)- 

1=1 

The Shannon entropy of the probability mass function 6 is —Gi (6) and for 
a ^ 1, -^ log Ga (S) denotes the Renyi entropy of order a (Cover and Thomas, 
1991). " 

Let T ~ iT{i))o<i<k„ be distributed according to the posterior distribution, 
a Bayesian estimator of Ga {Q) may be constructed using the posterior distribu- 
tion of 

G„,„(T) = ^.g„(r(*)). 

i=l 

The Bernstein-Von Mises Theorem asserts that under F„.o, for large enough ?i, 
the posterior distribution of (T(i))i<i<fc„ is approximately Gaussian, centered 
around the maximum likelihood estimator On = (^(«))i<i<fc„, with variance 
—I{6q)^'^. Theorem 3.12 below makes a similar assertion concerning Gn,a [T). 
Let Gn,a{()n) bc tlic truncatcd plug-in maximum likelihood estimator: 

Gn^a{dn)=Y,ga{On{i)). 

1=1 

The variance parameter ^n.a is defined by 



7' 

in.OL 



E^o(*)(5L(eo(*)))'- E^o«5a(^o(^)) 



^?:=l 



Notice that 






so that 72 1 has limit 7^ = ^^^ 0o(*)(log0o(*) + 1)' - (EZi Oo{^)ilogeo{^) + 
1))^ as soon as this is finite, and j^ a ^^^ hmit 7^ = <^^Ei^i ^o(*)^"~^ — 
(E^i^o(0")^] = a^[G2a-ii0o) - (GaiOo)?] as soon as this is finite, which 
requires at least that a > ^. 
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Now, let T be the collection of all intervals in R, and for any I E T, let 
$(J) = Jj(j>{x)dx where </> is the density of A/'(0, 1). The following Theorem 
asserts that the Levy-Prokhorov distance between the posterior distribution 
of ^/n{Gn,a (T) - Gn,a(On)) and Af{0,^a) tends to in P„,o probability. The 
Levy-Prokhorov distance mctrizes convergence in distribution. 

Theorem 3.12. (Estimating functionals) //lim„7^„ = la is finite, then 
under the assumptions of the Bernstein-von Mises Theorem (Theorem 3. 7), 



sup 



^H„\Xi,„ e V - •!• (J j 



y 7n, 

in Pn,o probability. 

The proof of this theorem is given in Section 5. 

Let us define the symmetric Bayesian credible set with would-be coverage 
probability 1 — 5 as the smallest interval which has posterior probability larger 
than 1 — a. This credible set is an empirical interval since it is defined thanks 
to an empirical quantity, the posterior distribution. In order to construct such 
a region, it is enough to sample from the posterior distribution using mcmc 
sampling methods. Note that this symmetric Bayesian credible set is not the 
(non fully empirical) interval 

U5jn,a ^ rn\. '^^^" 



Gn.a{(^n) 1=—', Gn,a\"n) + 

y/n 

where us is the 1 — 5/2 quantile of A/'(0, 1). Theorem 3.12 just asserts that asymp- 
totically, the symmetric Bayesian credible set has length usjn.a/V^- and is cen- 
tered around G„.q(6'„). Hence Theorem 3.12 asserts that, in P„_o-probability, 
Bayesian credible sets for Ga (Gq) and frequentist confidence intervals based on 
truncated plug-in maximum likelihood estimators are asymptotically equivalent. 
The next theorem provides sufficient conditions for the plug-in truncated 
maximum likelihood estimators to satisfy a central limit theorem with limiting 
variance 7^. 

Theorem 3.13. (MLE functional estimation) Assume that\imnjn,a = la 
is finite. If the truncation parameter kn satisfies: 

{nmii<k^Oo{i)y^''^ kn = o(l) 
^0o{^)\g'A0o{^))\' - o {V^ 

i=\ 

then \/n(G„,a{(^n) — Ga (^0)) converges in distribution to A/'(0,7^). 
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3.3. Dirichlet prior distributions 

We may now check that when using Dirichlet distributions as prior distributions, 
there exist truncation levels {kn)n and radii (M„)„ such that Conditions 3.4 
(prior smoothness) and 3.5 (prior concentration) hold. 

Let (3 = (/3o, /?!,... , Pk„) be a (fc„ + l)-tuple of positive real numbers. The 
Dirichlet distribution with parameter (/3o,/3i, ■ ■ ■ )/3fc„) on the probability mass 
functions on {0, 1, . . . , A:„} has density 

lli=0^ yf-^i) i=0 

In the absence of prior knowledge concerning the sampling distribution 0q, we 
refrain from assigning different masses on the coordinate components: wc con- 
sider Dirichlet priors Wn.p with constant parameter (3 = (/3, . . . , /3) for some 
positive p. 

Note that for /? = 1 (the so-called Laplace prior), the Prior Smoothness 
Condition (3.4) trivially holds. 

Proposition 3.14. Let the sequence of prior distributions consist of the Dirich- 
let priors with parameter /3 > 0. The non-parametric Bernstein-Von Mises The- 
orem (3.7) holds if the sequence of truncation levels (A;„)„giN; ^n —*■ +oo 
the p.m.f. over N*, 6q satisfy Condition 2.1, and if 

1/3 

fc„ logn V log (det(/(6'o))) = o ( ( n inf 6*0(1) ' 



Using such Dirichlet priors, checking the conditions of Theorem 3.7 boils 
down to checking the Prior Smoothness Condition. 

Proof. For any h and g in £'ep.fc^(M„), for large enough n. 



w. 



sup 



"./3(^0 + ^) 



/i,SGf(,o,k„(A/„) Wn,fi{OQ + ^) 



< 



sup 



n 



9o« + ^^ 



/3-1 



\p-i\Y^Lgii 



< sup exp 

h,se£eo,k„{M„) 

< sup exp 

h,ge£ga,k„(M„) 



i^-iim 



\H^)\ 
\/nOci{i) 



log 1 






< exp 3 



^ M„(fc„ + l)(/3-l)2 
ninfi<fc„ 00 (0 



IffWI 

\/nOn{i) 
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as for each g e £eo.k„{Mn), \g{i)\/y/n9o{i) < ^/M„/(ninfi<fe„ Oo{i)) so that 
as soon as Condition 3.1 holds, for large enough n, \g{i)\/yi^9o{i) < \ and 
log(l- |sr(z)|/V"^o(«)) > -2|,9(i)|/^6'o(i)- Thus, the Prior Smoothness Con- 
dition holds as soon as 

0, 



(ninfi<fc„ 6*0(1)) 
which is a consequence of Condition 3.1. On the other hand 

- logu;„(^o) = O (log(det(/(0o))) + K fog(fc„)) . 

Thus, using Dirichlet prior with parameter /?, the Prior Smoothness and 
Prior Concentration Conditions hold for 6q with truncation levels fc„ as soon as 
Condition 3.1 and 

K fogn + log(det(/(0o))) = o (M„) . 

But the existence of a sequence of radii (M„) tending to infinity such that both 
the last condition and Condition 3.1 hold, is a straightforward consequence of 
Condition 2.1 and of the condition in Proposition 3.14. D 

Note that if the prior distribution is Dirichlet with parameter (3 then the 
posterior distribution is Dirichlet with parameters (3 + (iVo, Ni, . . . , Nk„) ■ Let 
rii = '^j^iNj for i < fc„, agreeing on no = 0. Sampling from the posterior 
distribution is equivalent to picking an independent sample of n exponentially 
distributed random variables, Yi, . . . ,1^, picking another independent sample 
Zq, . . . , Zk^ of fc„ + 1 independent r(/3, l)-distributed random variable, and let- 
ting 9*{^) = {Z., + E„.<j<„.^, y,)/iEU Yj + E'=o Zj). The latter procedure 
is very close to the Bayesian Bootstrap (Rubin, 1981), indeed, we obtain the 
latter procedure if we omit to add the Zi in the weights. This procedure which 
has been extensively investigated (See Lo, 1988, 1987; Weng, 1989, among other 
references) is now considered as a special case of exchangeable bootstrap (See 
van dcr Vaart and Wclhicr, 199G, and references therein). Theorems from the 
preceding section tell us that the Bayesian bootstrap of (non-linear) functionals 
of the sampling distribution approximate the asymptotic distribution of maxi- 
mum likelihood estimates. We leave the analysis of the second-order properties 
of the posterior distribution to further investigations. 

3.4- Examples 

Previous results may now be applied to two examples of envelope classes already 
investigated by Bouclicron et al. (2009): 

1. The sampling probability 60 is said to have exponential(77) decay if there 
exists 1] > 0, and a positive constant C such that 

Vi G N*, — exp(-?7z) < 6*0(1) < Cexp(-77i). 
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Using truncation level fc„, 

-— -exp -77fc„ < 9o(0) < -exp(-77fc„). 

C(l - cxp(-77)) 1 - exp(-77) 

2. The sampling probability 9o is said to have polynoniial(77) decay if there 
exists ?? > 1, and a positive constant C such that 

1 C 

ViG N*, -— < 0o(i) < — . 



Using truncation level fc„, ,^ +1)1-'^ — ^o(O) < 717 



c 



Let us first assume that 6q has exponential(77) decay. Then with c ~ -^ A 

exp(-T)) 
C(l-oxp(-r,)) ' 

inf 6'o(J) > cexp(-7]/i:„), 

i<kn 

V^i fl r^ ^ /^ fc„(fc„ + 3) cxp(-?7) 

- Vlog^'o « < hy ^ fc„ + l)logC-log- — - 

^ V 2 1 - exp(-r?) 

Invoking Proposition 3.14, the non-parametric Bernstein-Von Mises Theorem 
holds for 9o with exponential(77) decay using the Dirichlet prior with parameter 
/? > with truncation levels 

fc„ = — (log n — a log log n), a > 6. 
V 

Theorems 3.12 and 3.13 apply as soon as a > ^, so that the Bayesian estimates 
of entropy and of Rcnyi-entropy of order a > ^ satisfy a Bernstein-von-Mises 
theorem with ^/n-ratc. 

When 60 has polynomial(77) decay, infi<fe„ ^o(*) > 1/(CA:JJ), and 

k„ 
- ^ log 6*0(1) < {l-jkn log kn - (kn + 1) log C) . 
i=0 

Invoking Proposition 3.14, the non-parametric Bernstein-Von Mises Theorem 
holds using the Dirichlet prior with parameter /3 > with truncation levels 

kn= { — , J, To -^ +00. 

\UnJ [logny 

Theorems 3.12 and 3.13 concerning estimations of functionals hold as soon 
as 2a > 1 -|- 1/77, so that the Bayesian estimates of entropy and of Renyi-entropy 
of order a > 1/2 + 1/(277) satisfy a Bernstein-von-Mises theorem with rate ^/n. 
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3.5. Comparison with Ghosal's conditions 

Now, we aim at comparing the set of conditions used by Ghosal (2000) to es- 
tablish a Bernstein- Von Mises Theorem for sequences of multinomial models us- 
ing log-odds parametrization. An exhaustive comparison of the two approaches 
(that is, comparing the merits of combining Le Cam's proof and concentra- 
tion inequalities for some quadratic forms with the merits of Ghosal's proof 
which refines Portnoy's arguments) should first be based on a general purpose 
result characterizing the impact of re-parametrization on asymptotic normal- 
ity of posterior distributions. This would exceed the ambitions of this paper. 
Then a thorough comparison between conditions (P) (Prior Smoothness and 
Concentration) and (R) (Prior concentration and behavior of likelihood ratios 
in the vicinity of the target 6*0) and the conditions used in this paper would be 
in order. As a matter of fact, provided re-parametrization is taken into account, 
the prior smoothness conditions in the two papers are not essentially different. 
On the other hand the conditions on the integrability of likelihood ratios seem 
somewhat different. Looking for general exponential families, Ghosal (2000) im- 
poses upper-bounds on the fourth and the third moment of linear forms of 
y^ I {9) Ai{9) for 9 close to 6*0 (this is the meaning of conditions on the growth 
oi Bi^n{c) and i?2,n(c)-) In this paper, we take advantage of the fact that A„(6') 
is a multinomial vector. 

Keep in mind that we refrain from assuming that all 9o{i),i < kn are of 
order l/fc„ as in (Ghosal, 2000, page 60). Indeed, we consider situations where 
fc„infi<fc„ 0o(^) = 0(1) as in Section 3.4. The trace of the information ma- 
trix I{9o) (which coincides with F~^ using Ghosal's notations) is equal to 
J2i=i l/^o(«) + kn/9o{0) < 2A:„/infj<fc„ 6*0(0, and it may not be O(fc^), as in 
(Ghosal, 2000, page 60). For example, using the notations from Section 3.4, if 
9o has polynomial-(77) decay, E!'=i l/^o(J) + K/9o{0) > ^ Eti *" + ^^r' > 

r;+l + C ■ 

In this setting, we may even look at the growth of -62.0(0) (as defined 
in (Gliosal, 2000)) as n tends to infinity 



B2,„(c) = sup<^F 



a^/^(0)A„(0) 



l-Var.„(log|^)<^ 



Choosing a as ^|=1 and carefully performing straightforward computations, it is 
possible to check that if 9o has polynomial- (77) decay (according to the framework 
of Section 3.4), i?2,n(0) > Ckf^^, so that the clause B2,n(clogfc„)fc^(logfc„)/n ^■ 
for all c > in Condition (R), implies fc^^^** logA:„/n -^ 0. This condition is 
more demanding that the conditions we obtained at the end of Section 3.4. 



3.6. Classical non-parametric approach to posterior concentration 

We compare the posterior concentration lemma (Lemma 3.6) and the classi- 
cal results on posterior concentration obtained in non-parametric statistics (See 
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Ghosal and van der Vaart (2007a), Ghosal ct al. (2000), Ghosal and van dcr Vaart 
(2007b), Ghosal and van dcr Vaart (2001)). 

Let Gfc„ denote the set of probability distributions over {0, . . . , fc„}. Let e^ 
satisfy ne^ = M„. 

Let Ki(en) be the set: 



Vn (e„) 



|:^o(^)log^ < el and £^0(0 (log^V < el 



Let d denote the Hellinger distance between probability mass functions: 



71, C2 



) = 



£(v/^-v^ 



1/2 



Let D{e,Qk^,d) denote the e-packing number of 6fc„, that is the maximum 
number of points in Qk^ such that the Hellinger distance between every pair is 
at least e. 

Theorem 2.1 in Ghosal ct al. (2000) asserts that, if for some C > 0, we have 
Wn{Vn{en)} > cxp (— Ctie^) and if log-D(e„, 9^^, d) < ne^, then for large 
enough A, 

P.|x,„{d(0,0o)>Ae„}^O 

in Po,„ probability. 

In this paper, the prior Wn is supported by 9fc„ , and a careful reading shows 
that the proof in Ghosal ct al. (2000) can be adapted to situations where the 
sampling probability changes with n. 

Now, Qk„ endowed with the Hellinger distance is isometric to the intersection 
of the positive quadrant and the unit ball of ]R'^"+^ endowed with the Euclidean 
metric, so that there exists a universal constant C 

^•(^)"<i^(.e.„,d)<c.(l"" 



and logZ?(e„, 9fc„, d) < nel if and only if fc„ log jj- < CMn- 

h(i) 



lih^ I{9o)h < Mn, then sup,<fc^^ 



log 1 






< 



ninfi<fc„ Soli) 



Hence, for i < k„ 



hit) 



^iBoii)) V^Mi) 2n(?o(i)2 
So that, letting 0{i) = 6»o(i) + ^, 



Hi) 



nOoiif 



i=0 



and 



Y^e^ii) [log 



i=0 



0(^) 



2n 



aljh) ^ Jajih) 



S. Boucheron and E. Gassiat/A Bernstein- Von Mises Theorem 132 



Hence, as soon as ■ t-^^'^e a) ~ ^i^)' ^^ '^'-"' some S > and some c > 



Wn {h : h^I{9o)h < SM,,} > e"'=^^" 

then for some constant C > 0, 

Wn {K (e„)} > e-^^^" . 

Under our assumptions, the non-parametric Bayesian theorem imphes that for 
large enough A, P.|Xi,„ {d{0,9o) > Aen} ^ in Po,„ probabihty. 

However, Lemma 3.6 posterior concentration with respect to the Fisher dis- 
tance: 

T.ix,,„{{o-eofiieo){e-eo)>el}^o 

in Fo,„ probabihty. 

As the Fisher distance upper-bounds the squared HeUinger distance (See 
Tsybakov, 2004), Lemma 3.6 imphes the generic posterior concentration lemma. 
But Lemma 3.6 could not be deduced from generic posterior concentration 
lemma since the Fisher distance cannot be upper-bounded by a linear function 
of the HeUinger distance. As a matter of fact, 

{9 - e„f I {e„) {e - 9o) < d{e,e„)(i + 



info<.t<fc„ 9o{i) 



Hence, if d{6,do) = o(info<i<A;„ 0o(i)), HeUinger and Fisher metrics are com- 
parable, but this does not hold in full generality. For instance, for 9 such that 
9{k„) = 9o{k„) + ./Mkn), 0{l) = 0o(fcn) - \Afe, and 9{i) = for i ^ 1 and 
i 7^ A:„, then d{9, 9o) -^ 0, but {9 - 9of I {9o) (9 - 9o) ~ 1. 



4. Proof of the Bernstein- Von Mises Theorem 

In this section, we establish the building blocks of the proof of the Bernstein- Von 
Mises Theorem that is Proposition 3.10, the posterior concentration Lcnuira and 
Proposition 3.11. 

4.1- Truncated distributions 

In order to prove Proposition 3.10, it is enough to upper bound NV(M„): 

where An and Cn are defined in Section 3.1. 

We take advantage of the fact that integration is performed on £g„,kr,{Mn), 
in order to uniformly upper-bound the integrand. 
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Using the duality between £l^ and f^, for all h e £gg^k„{Mn) 



An{h) < sup y^ h{i] 

heel ||h||i<M„ j^Q 



\nOo[i) J i=o,...,fc„ 



N, 



ndoii) 



-1 



and 



\Cn{h)\ < Mn sup 

i=0,...,kn 



iV,: 



n6lo(i) 



i? 



/m; 



^^ninfj<fc„ 6*0(0^ 



Then as (1 - (1 - x)e-y)+ <x + y ior x,y> 0, 



NV(A'f„) < Mn X sup 



N, 



nOo{i) 



2 sup 

j<fe„ 



N,, 



nOo{i) 



R 



/li^T 



■^/n inf. 



i<k„ Oq 



«, 



inf 



1- 






The second term can be upper-bounded assuming the prior smoothness condi- 
tion. The first term is a sum of two random suprema. 

The expected value of the maximum of random variables with uniformly con- 
trolled logarithmic moment generating functions can be handily upper-bounded 
thanks to an argument due to Pisier (Massart, 2003): if {Wi)i<i<k are real 
random variables, then 



P 



sup Wi 

i<k 



< inf T- "^ log ^ + sup log P [exp \Wi\ \ . 



(4.1) 



For each i, the random variable Ni is binomially distributed with parameters n 
and 0o{i), log(l +u) < u, and for all u> 0, ^ ^^ 
leads to 



1 > 



I'So 



sup 

i<k„ 



N,, 



n9o{i) 



< inf 

A>0 



log2(fc„ 



1) exp 
-^ - 1 + sup 



+ 1, so that (4.1) 

\neo{i)J 



n9o(i) 



SO that choosing A = ■\/log(2(fc„ + l))ninfi<fe^ ^o(*)i ^^ the function ' 



1 is increasing on R-|_, letting (5„ 

iV,, 



_ / iog(2(fc„+i)r 

ninfi<fc eo{i) ' 



Pe 

Thus, 



i<k„ nOoW 



1 <P, 



sup 

i<k„ 



N,, 



F„,oNv(M„) < A/„ 



n9o{i] 
cxp{S„) - 1 

Sn 



< Sn + 



cxp((5„) 



1 + R 



/Mn 



^ninf,<fc„ 6*0(1) 



inf 



_ he£eo.k„{M„)Wn[Oa + ^) 

SUp^g£^^^^(Af^) Wn{Oo + ■^) 

and the proposition follows using Assumptions (3.1) and (3.4) and the fact that 
R{u) = 0{u) as u tends toward 0. D 
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4-2. Tail bounds for quadratic forms 

In this section, we gather a few results concerning tail bounds for quadratic 
forms or square-roots of quadratic forms in Gaussian and empirical settings. 
All those bounds are obtained by resorting to concentration inequalities for 
Gaussian distributions or for suprema of empirical processes. 

Let us first start by a first bound concerning chi-squarc distributions. Let ^^ 
be distributed according to xt (chi-square distribution with kn degrees of free- 
dom), the following inequality is a direct consequence of Circlson's inequal- 
ity (Massart, 2003): 

P {Cn > VK + V2^} < exp(-x) . (4.2) 

The following handy inequality provides non-asymptotic tail-bounds for Pear- 
son statistics. For any 6 G Qk„ let Vn{0) denote the square root of the Pearson 
statistic 

The following follows from Talagrand's inequality for suprema of empirical pro- 
cesses (Massart, 2003, p. 170): for all a; > 0, 



F„,o K(0o) > 2Vfcn + V2.T + 3 , ^ ^ == < exp(-x) . (4.3) 

Non-centered Pearson statistics also show up while proving the posterior 
concentration lemma. Let 6 = Oq + -j= with cr„(/i) > y/Mn ■ Note that from the 
definition oiVn{9o), it follows that 

K(6lo) = sup > a, ^— — - 



1=0 
kr 

E 



N^-ne{i) , e{i)-eo{i) 



sup 7 . a. I ^-— ^ + V«- r^-^ . 

.=0 V"^o(«) 
where a* = , ^^' for all i < kn- So that 

P„,/^(14(0o) < s„) < F„.fe( Va* ^'I^^ < -a„(/i) + s„ 
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Computations carried out in the Appendix allow to establish that if (7^(/i) > Af„, 
and if M„ = o{nmU<k„ 0o{i)), 

^+ , ^^^" =|<2cxp(-4^ 



I'nM < Vn{9o) < 2^kn + J^+ ,,.'' = ^ < 2exp (-i^) . (4.4) 



4-3. Proof of the posterior concentration lemma 

Proof. We need to check that Ph„\Xi„ {H^I{9ct)Hn > Mn} is small in F„.o 
probability. For any 9 E B^,, , let Vn{9) denote the square root of the Pearson 
statistic 



1/2 

M0) = ( E '^'\':r' 1 = {^iio)m^ni9)) 



^ n9{i) 



\i=0 

A sequence of tests (0ri)n6N is defined by 4>n = ly„(eo)>s„ where each threshold 
Sn is defined by s„ = 2\/k^^-\-^J2xn + 'ixn/ ^nvaii<k„ 9o{i) with a:„ = -f^. The 
tests (/)„ aim at separating 9q from the complements of Fisher balls centered at 
9o, that is from {9o + h^ : crKh) > M„} . 
Hence, we need to check that 

PH„\X,:AHnnSo)Hn>AQ 

+ Ph„|.y,„ {H^Ii9o)H,-, > M„} (1 - (/.„) 
< 0n + Ph„\x^.,„ {H^Ii9o)H„ > A/„} (1 - 0„) . 

is small in P„,o probability. As, in order to upper-bound Pn,o0nj it is enough 
to bound the tail of Pearson's statistics under P„,o, we focus on the expected 
value of the second term. Note that the latter is null as soon as the maximum 
likelihood estimator errs too far away from 9q. 

In order to control Pn,oPH,^\Xl:,^ {^n^i^o)Hn > Mn} (1 — </>„) , we resort to 
the same contiguity trick as in (van der Vaart, 1998). Let A be a fixed positive 
real, define the probability distribution Fn,A on N" as the mixture of P„./i when 
the prior is conditioned on the ellipsoid 9q + -j=£eg^k„{^)' 

^ ,^, K.AA)^-^'^(B)w49o + ^)dh 

Arguing as in (van der Vaart, 2002), thanks to Lemma 2.3, one can check that 
the sequences (Pn,o)„ and (Pn,A)„ are mutually contiguous (for the sake of self- 
reference, a proof is given in the Appendix, see Section A). Hence, it is enough 
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to upper-bound 

Pn,A [Ph„\X,..„ {H^Ii9o)H„ > Mn} (1 - 0„)] 
SUp^T/(g^)/j>M„ IPn,?i (1 - 0n) 

" W„{0o + feo,fcJ^)/\/^} ■ 

We will handle P,i,o'/'n and P„,/t (1 — (/)«) using non-asymptotic upper bounds 
for centered and non-centered Pearson statistics while the prior mass around ^o 
[Wn{9o + £0o,A:„(^)/\/«}) Can be lower-bounded by assuming Conditions 3.4 
and 3.5. 

A direct application of Inequality (4.3) gives F„.o0ri = Pn.o {Ki(^o) > s„} < 
exp(-a;„) = exp (-M„/4). 

Non-centered Pearson statistics show up while handling P„./i (1 — 0„) . In- 
deed P„,ft(l - 0„) = P„j, (K(6'o) < s„) . Let 6* = 6*0 -h ^ with cr„(/i) > ^m;;. 
Then, using the definition of ^„, Inequality (4.4) entails 

P«j>(l-0«)<2expf— ^ 

Let us now lower bound W„{0o + '^6io,fc„(^)/v^}- Performing a change of vari- 
ables (agreeing on the convention that /i(0) = — X]i=i '^(*)) leads to 

Wr.^^{n{e~e^fi{e^){e-e^)<A} 






-1 



> ( ■:7-------''°"';°:f' )'.„..(.o)(i)-/ ftdM.). 

But the volume of the ellipsoid in R*^" induced by i?eo,fe„ (^) is the inverse of the 
square root of the determinant of /((?o) (that is Hiio ^o(j)^^^) times the volume 
of the sphere with radius ^/A in P'^". that is A^"/'^ — Xn-, so that 

WnA<d - 0ofl{Oo){O - 0o) < A} 



Thus, assuming conditions 3.4 and 3.5: 

SUPh^/(go)/i>A/„ ^n,h (1 - (t)n) 



Wn{eo+£9o,kM)IV^} 



i 


(Oo) 


11 

i=0 


eo(*) 


,1/2 M' 


)^ 


2r(i)^" 


Wn 


fcnr(^)- 


<c 


'exp 


/ 
V 


Mn 
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)(l + o 


(1)) 


n 
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4.4- Posterior Gaussian concentration 

Proving Proposition 3.11 amounts to checking that the growth rate of the se- 
quence of radii Af„ is large enough so as to balance the growth rate of dimen- 
sion fc„. By Lemma B.l: 



■<f"-A4„ 



2 / dAfk„{An,eo,I~\eo)){h). 

/ct„(/i)>M„ 



a„{h)>M„ 



< 



The right-hand-side can be upper-bounded: 

dA4„(A„,eo,^"'(^o))(/i) 

l(/i+A„,(,o)^-f(eo)('i+A„,(,o)>M„dA/'fc„(0, ^^^(^o))(/i) 

lhr7(e„)ft>M„/4dA/fc„(0, /~^(6'o))(/l) + lAjg^7(eo)A„,eo>M„/4 
^\\h\\2>VM^/2^k„{0JdkJih) + lA^_,^/(eo)A„,eo>M„/4 • 

so that 



< 



P [Ca > V^} + F,,o (a^^, J(0o)A„,,„ > ^) 



where ^^ is distributed according to xt (chi-squarc distribution with /c„ degrees 
of freedom). Then, invoking (4.2), 



P{^n> 



< Cxp — -r— 



The second term in the upper bound is handled using (4.3) and choosing x 

"^^ \^ 128 ' 



512 



5. Proof of Theorem 3.12 

In frcqucntist statistics, once asymptotic normaUty has been proved for an esti- 
mator, the so-called delta-method allows to extend this result to smooth func- 
tionals of this estimator. In this section, we develop an ad hoc approach that 
parallels the classical derivation of the delta-method. Taylor expansions allow to 
write ^/n{Gn.a{T) — Ga{On)) as the sum of a linear function of H„ — A„(0o) a-nd 
of two (random) quadratic forms. Checking the theorem amounts to establish 
that under Pn,o those two quadratic forms converge to in distribution. 

Recahthat i?„ = V^(T-(i)-^o(«))fei and A„{9o) = y^{e{i)-eQ{i))'lZi- If n is 
non-ambiguous, let VGa(^) = {g'c,{0{i)))ti and let V^G{e) = diag{g'^{9{i))ti)- 
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Then for some (random) veetors f and 6 with f(i) (resp. 0{i)) between T(i) and 
0o(*) (resp. between 0„(i) and ^o(*)) for aU i = 1, . . . , fc„: 

V^{GnAT) ' G^{d)) = (i7„-A„(0o)fVG„(0o) 

+ -J_ijJv2G„(f )i/„ 



2v^ 
1 



A^(0o)V2G„(^)A„(0o). 



This foUows from 

V^(G„,a(T)-Ga(0o)) 



+00 /c„ 

+00 

-V^ Y. 3a(^o(*)) + i^,TVG,(0o) + i?„ 



with 



Meanwhile 



R„ 



1 



2v^ 



if^V^G„(f)i7„. 



/^(G„,„(^„)-Ga(0o)) 



+ 00 



fcn 



-V^ 51 5a(^0(«)) + V^^(3a(^n(*))-5a(^0(«)) 

+ 00 



with 



Rv 



1 



2V^ 



Al{0o)V^Ga,{9)An{0o) 



Now recall that 7„ ^^ = Var ((iJ,i — An{6o))'^S/ga{do)) . Let (e„)„ be a sequence 
tending to as n tends to infinity. 

For any interval / of R, let /e„ be the e„-blowup of /: I^^ = {x : 3y e /, 
|a; — J/| < e„}. Then, for some positive constant C 



sup 



/V^(G„.a(r)-G„,„(0„)) 

^H„|Xl:„ ^ 



ei] -$(/) 



< sup 

/ex 



P, 



i/„|-^l:„ 



(g„-A„(go))^Vg„( 



e/e„ -$(/) 



-f-H,i|-Yi,„ { l^n ^ Rn\ > CnG 
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The first summand on the right-hand-side is easily dealt with by applying the 
Bernstein- Von Mises Theorem: 



sup 



/ (g„-A„(0o)rVg^(0o) \ 

^H„\Xi,„ e -'e„ I - '5 U j 

\ ln,a J 



< sup 1$ (/,„) - $ (/)| + Pff„|x,„ - M{A„i9o)J-\9o)) 
lex 

< ''' 



Now, 



PuAK... \\rL^-Rn\ >Ce„ <P„ ,;.,., i?„> —li +1- 



C: 



£)! 



as i?„ is Xi:„-nieasurable. Theorem 3.12 follows if it is possible to choose a 
sequence (e„)n such that both terms in the upper bound tend to in Pn.o 
probability. 

Let us focus for the moment on the first term. We aim at proving that the 
following upper-bound holds for large enough n and some positive constant D: 



< 5F^„|x,.„ (i?J/(eo)i^„>i?e„ M^mf 0o(.?)l • (5-1) 

As (7^ is monotone, the following inequalities hold: 

\a'L(r(i))\ < max{|g::(r(*))|; Ig^iOom} < Ig^irm + W^iOom 

This entails 

V^Rn < H^V^G^{0o)Hn + HlV^G^ (f) iJ„ . 

Ph„|x,„ {Rn > Ce„) < Ph„\x,..^ (Hl\/^G^{eo)H^ > 9ll^\ 

+ Ph„\x,..^ (hI^'G^ (f ) H„ > ^^ 

Henceforth, let Ca = a(a~i) ^'^^ ol^\ and G\ = \. Note that for all positive 

C^gl{x) = x"-2 and G^HlV^G^{e^)H^ = YHz, f^^o(0"~^ • 
For a > 1, 

Meanwhile, for i < a < 1, the obvious fact supj((?o(J))"~^ < (ii^fj<fc„ ^o(j))~^^^ : 
implies 
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SO, we get 



\i=l 



tT T(a \ U ^ ^a^-^n 



< Ph^\x,..^ \ K HOo)Hn > -^^Jnmi^ 0o(j) 



On the other hand, 

fcn 






z=l 

1 l-^TiU Jl * 



~t Mi) I y/nMj<k„eo{j) 



k n I l\^ g„(') 



^ra\i/2^ "-2 



Hence, 



fi Mi) I ^nmfj<fc„6loO') 



^..„|x.„fi:(i?"W)'rW"-^>^"^^" 






2V^ 



< Phjx,..„ ( E ^^o(*)"'' ^ 2"-3c„Ce„V^ 
We may now sum up those incquahties: 

Ph„\X,..^ (\Rn\ > ^) < J2PH.\X^-\H^Iieo)H„ > A. 

with 

^3,n = 2 Ai_„ 

/( oa — 2 /( 

^4,ri — ^ ^2,n 

A5,„ = n inf 6*00) . 

3<k-n 
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This is enough to prove IncquaUty (5.1). Up to ||-Pff„|A:i.„ ^-Mcn II, the posterior 
probabihty of the event H^I{9Q)Hn > Un equals A/fc„ {H^I{9f))Hn > ««} , that 
is the probabihty that a non-central xi -distributed random variable with non- 
centrality parameter A'^{9o)In{0o)An{9o) = Vn {Oq) exceeds w„. The latter 
probability is upper-bounded by 

where ^^ follows a xt distribution. The probability that ^^ is larger than u„/4 
may be upper bounded using Cirelson's inequality. As soon as (e„)„ satisfies 



£,1 7 > +00, 

this probability tends to 0. 

The P„_o-probability that the non-centrality parameter V^{9o) is large may 
be upper bounded using Inequality (4.3), and invoking the Bernstein- Von Mises 
Theorem to handle ||Pffjjf, „ - AA(A„(0o),/-^(eo))ll, 

^i.„|A-.„(|i?„l>^)-0 

in P„.o-probability. 

Using the same approach as before, one establishes that for large enough n 
and some positive constant D: 



Ce„ 



Pn.o ( \Rn\ > -j^j < 5P„,o I V,t (^o) > i?e„ Ai^mf OoU) I . 

The right-hand-side may be upper-bounded using again (4.3). It tends to as 
soon as e„ -^/n inf j < a:„ 6*0 {j ) / k„ -^ +oo. 



6. Proof of Theorem 3.13 

We have already proved that i?„ — op^ o(l), ^^ that under the assumptions of 
Theorem 3.13, equation (5.1) translates into 

^^ [Gr.,a{On) - G„ (^o)) = o(l) + A^(0o) VG„ (^o) + op„,„(l) 
and the result follows from Berry-Essen Theorem. 

Appendix A: Contiguity 

We first prove Lemma 2.3. 
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Proof. Let us first notice tliat if cr,^j(/i„) tends toward a^ > 0, there exists some 
M > such that /i„ G £{M). The contiguity proof follows from a straight- 
forward analysis of the log-likelihood ratio and an invocation of Le Cam's first 
Lemma (van dcr Vaart, 2002). 

A Taylor expansion of the logarithm leads to 

log-j5 (x) 



fe„ 



m 



The proof consists in checking the three following points: 

1. the remainder term converges in probability toward 0. 

2. the first summand converges in distribution toward A/'(0, a^). 

3. the middle term converges in probability toward —0-^/2. 

Let us check the first point. As a matter of fact: 

But 

sup ^^^L < ^"(^") = 0(1). 

In order to check the the second point, note that the random variable Zn{hn) 
--k= y^ "n Ni „ ,1 can be rewritten as a sum of i.i.d. random variables: 



1 " 

.7 = 1 



with 



Under Pn,Oj each random variable Y, is equal to -fj^ with probability 9o{i), so 
that 



p|>^.f = E 



'" '/i„(*)|3 



.=0 'oi^) 



2 



i=0 



Y^ iM«ir ^ / _M«) \ ^ 2/, N 
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that is 



as a^{hn) is bounded and bounded away from 0. The Berry- Essen Theorem 
(Dudley, 2002) entails that as n tends to infinity, Z„(/i„) converges in distribu- 
tion toward M{0, a"^). 

Finally, the middle term ^ X]i=o ^i e"(i)^ converges in probability toward 
\'^n{hnY- Indeed, let 

1=0 "^ ' 3 = \ 



with 



Then 



i,{K)=Y^\x,= 



hnilf 



.=0 -'o(^)^' 



Var(C/„(/i„)) 



= 


^Var(ei(/i„)) 


< 


1 Y^ /in(i)^ 


^ 


M2 




?iinf.j<fc„ 6'o(i) 



Hence, the sequence of distributions of likelihood ratios p^^-^(Xi:„) converges 
weakly toward a log-normal distribution with parameters — ct^/2 and a^ . The 
Lemma follows directly from Le Cam's first Lemma (van dcr Vaart, 1998). D 

Lemma A. 1. LetOo denote a probability mass function owerN*. If the sequence 
of truncation levels {kn)ne'N satisfy Condition 2.1, the sequences (Pn,o)„ o.'^^d 
(lPn,A)„ are mutually contiguous. 

Proof. Let (i?„) be a sequence of events where for each n, Bn C {0, . . . , fc„}". 
Then 

F„.A [Bn) < sup P„,;, (S„) 

CT„(/l)2<A 

SO that for some sequence (/?.«)« such that for all n, cr^(ft,„) < A, 
limsupP„,A (Bn) < limsupP„,/i„ (i?„) . 

n — >+oo n — >-t-oo 

But as {Oo{i)) decreases to at infinity, £0„,kri{A) is a finite dimensional closed 
subset of a compact set in ^^(N*), so that one may extract a subsequence (/i„ )p 
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such that tJ^ {hn-p) ^ o"^ for some a^ as p -^ +00, and such that 
UmsupP„.A (B„) < hm Pn„./i,.„ (-B«„) ■ 

Applying Lemma 2.3 gives that F„.o(B„) ^ impUes F„^^(i3„) — > 0. 
The reverse imphcation may be proved with the same reasoning using that 

inf P„,;, (S„) < P„,^ (B„) . 

D 

Appendix B: Distance in variation and conditioning 

The obvious proof of the foUowing folklore lemma is omitted. 

Lemma B.l. Let P denote a probability distribution on some space {Q,J^). 
Let A denote an event with non-null P -probability and let P the conditional 
probability given A, that is P'^{B) = P{AC^ B)/P{A) then 

1|P^-P1| =P(A"). 

Appendix C: Proof of inequality (3.9) 

Proof. 







Using the convexity of a; t— > (1 — x)-^- and Jensen inequality, the right-hand-side 
can be upper-bounded: 



ju-M„ _ pA/„ 




dAr,^"W^n(^o^^^ '^^"-^^""^ 






+ 
Now, the quadratic Taylor expansion of the log-likelihood ratio translates into 



1 of the log-lik( 

{i(A„(3)-A„(/i)) + (C„(g)-C„(/!.))} 



P„,s(Xl:„)dA/f"(/l) 



P„.„(Xi:„)dAA,*^^"(5) 

Plugging this expansion into the upper-bound on Nv(Af„) leads to (3.9). 



D 



S. Boucheron and E. Gassiat/A Bernstein- Von Mises Theorem 145 

Appendix D: Tail bounds for non-centered Pearson statistics 

This section provides a proof of Inequality (4.4). Recall from Section 4.2, that 



Fn,h{Vn{eo)<Sn) < P„,,, K] a! 



\/n9o{i) 



< -an{h) + s„ 



where a* = 



Ml) 



for all i < kn- Despite J2iZo^ 



^,Ni~ne{i) 



is just a sum 



of i.i.d. random variables, we found no obvious way to use classical exponential 
inequalities (either Hoeffding or Bernstein inequalities) to prove the tail bounds 
we need. Before resorting to classical inequalities, we split the sum into two 
pieces according to the signs of the coefficients a* . The two pieces are handled 
using negative association arguments. 

Let J = {i : i< fe„, a* > 0} and J'' = {i : i< K, a* < 0}. Note first that 

\,=o V"^o(«) / 

N, - ne{i) 



< Pn,/> Yl 



a. 



\i£j 






,N,~n9{t) 



. Fn,h Y ^V-^r=r < -^ i'^nih) - s„) 



yiej" 



< inf exp logP„/j 

A<0 \ 



inf exp logP„,,i 
A>o \ 



\/n6o{i) 
exp Y ^^ 



„ N, - nO(i) 



\i£j 

exp 



{- E Aar 



N, - n9{i) 
^/n9o{i) 



+ 2 i<^n{h) - s„) 



- IJ i'^nih) " Sn) 



Following Dubhashi and Ranjan (1998), a collection of random variables Zi, . . . , Zn 
is said to be negatively associated if for any X C {1, . . . , n}, for any functions 
/ : RI-^I -^ ]R and g : R-^ — * R that are cither both non-decreasing or both 
non- increasing, 

P [fix, : i e J)g{X, : i € I^] < P [/(X, : i€l)]F [giX, : i € I')] . 

By Theorem 14 from Dul)haslii and Ranjan (1998), both sets of random vari- 
ables {ai*(Ni-ne{i))/^/n0o{i)),i€ J and (a* (iVi - n6l(i))/yy^^oW), « e J° are 
negatively associated in the sense of Dubhashi and Ranjan (1998). 

The logarithmic moment generating function of Y^-^T-a.* '/~" satisfies 

^ 6 6 ^»ei I ^f:;;:^) 



'^. 



.Ni-n^W 



«"' V^:«M^ < ^logP„,fte 
iei 



Aa* 



V"''o<') 



logP„,/te 
where X = J ot J'^ . Each A'^i is binomially distributed with parameter n and &i. 
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For i e J,a* >0 so that for A < : 



146 



logP„,^e ^■^^"' y^M^ <^ 



iej 



29oii) 



Note that 



E«."|t s E«." + E-:^^ 



A(i) 



iej 



9o(*) 



< 1 



1/2 



o-„(/i) 



^ninfi<fc„ 6*0(1) 



Hence 



inf exp logP„,;, 



^ iV, - ne{i) 



exp ^ Aa* ^^ 
. \»ej V«^o(«) 



+ 2 ('^"('*) ~ ■5") 



< exp 



(o-„(ft,) - s„)^ 



8 1 



OrJh) 



y/nin[i<k„ So(i) 



If an(h) > 2s„, an{h) — s„ > ct„(/i)/2 and the last term may be upper-bounded 
by exp(-^[cr2(/i) A CT„(/i)i/ninfi<fe„ 6*0(0]) ■ 

Now, if i € J'^, ft.(i) < so that h{i) > — y^6'o(«) which entails —a*/y^n6o{i) < 
77{h)- For any A > 






exp — AaJ 



N, - n0{i) 



< 



< A 



2^o(*)(l-^) 

/^ a„{h) 



^ninfi<fc„ eo(i) 



2(1-^) 



Hence 



inf exp logP„,/j 

A>0 \ 



exp - ^ Ae 



iej'^ 



y/n9o{i) 



- 2 (0'n(/l) - S„) 



< exp 



(cr„(/7.) - SnY 



<yAh) 
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If an{h) > 2s„, Gnih) — s„ > cr„(/i)/2 and the right-hand-side is upper-bounded 

by 

^o•,^l(/^) A A/ninfi<fc„ 9o{i)(Jn{h) ^ 
exp 
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